About Cyclistic

Cyclistic is a hypothetical bike-sharing company based in Chicago. Cyclistic has two membership types, casual and annual. In the past, Cyclistic’s marketing team has been targeting new members in order to increase the awareness of the company within communities. Now, the marketing team believes that the next step to take in order to grow the company to the next level is to convert the casual members into annual members.

The Goal of This Project

In order to convert the casual members into annual members, I decided to analyze the difference in bike usage between the two member types and use the discovered insights to build effective strategies for the conversion. To do so, I asked and answered several questions.

Questions to Answer

  1. What is the average ride duration of each member type?
  2. How far do each member type travel on average?
  3. On which day of the week does each member type prefer to ride the bikes?
  4. How does the usage of the bikes change throughout the day for each member type?
  5. What is each member type’s seasonal preference for riding the bikes?
  6. What is each member type’s bike type preference?
  7. Where are some of the popular areas for riding the bikes for each member type?

Preparing the Data

In order to answer the questions, I downloaded one year worth of the Cyclistic members’ trip data (July 2021 - June 2022) from the Cyclistic database. The dataset can be downloaded from here.

Since the dataset has a total of more than a million rows, I decided to use SQL as my main tool for the analysis. However, before using SQL, I quickly inspected the data with Excel, checking things such as the attributes, the data layout, the data size and potential room for error.

Then, I created a table in PostgreSQL and imported the 12 csv files into the table.

-- Creating the table for the csv files.

CREATE TABLE cyclistic.trip_data (
    ride_id text NULL,
    rideable_type VARCHAR(15) NULL,
    started_at timestamp NULL,
    ended_at timestamp NULL,
    start_station_name varchar(100) NULL,
    start_station_id varchar(100) NULL,
    end_station_name varchar(100) NULL,
    end_station_id varchar(100) NULL,
    start_lat float NULL,
    start_lng float NULL,
    end_lat float NULL,
    end_lng float NULL,
    member_casual char(6) NULL
);

-- Importing csv files into the 'trip_data' table. Do this for all 12 csv files.

COPY trip_data(
    ride_id,
    rideable_type,
    started_at,
    ended_at,
    start_station_name,
    start_station_id,
    end_station_name,
    end_station_id,
    start_lat,
    start_lng,
    end_lat,
    end_lng,
    member_casual
)
FROM '/Users/bruce/Desktop/case_studies/cyclistic/cyclistic_data_original/cyclistic_data_202107.csv'
DELIMITER ','
CSV HEADER;

With this initial setup, I dived into answering the questions.

Question 1: What is the average ride duration of each member type?

In order to find the average ride duration for each member type, I first had to calculate the ride duration of each trip. I did this by subtracting the ‘started_at’ column from the ‘ended_at’ column.

CREATE TEMP TABLE trip_data_v01 AS(
  SELECT *, (ended_at - started_at) AS ride_duration
  FROM trip_data
);

Upon inspection, I found some negative values in the ‘ride_duration’ column, which were errors, since there cannot be negative time value in ride duration. So I deleted the rows with negative ride duration values from the table.

DELETE FROM trip_data_v01
WHERE CAST(ride_duration AS text) LIKE '-%';

Then I queried to get the average ride duration of each member type.

-- Finding the average ride duration for each member type.

SELECT AVG(ride_duration) AS avg_ride_duration, member_casual AS member_type
FROM cyclistic.trip_data_v01
GROUP BY member_casual;
2 records
avg_ride_duration member_type
00:29:49.144628 casual
00:12:58.955735 member

By answering Question 1, I learned that, on average, the causal members rode the bikes more than twice as longer as the annual members.

Question 2: How far does each member type travel on average?

Inspecting the dataset, I realized that I did not have the data for the actual distance traveled on the bike. However, I did have the latitude and the longitude values of the start and the end of points of the ride. So, I decided to use the distance between the start and the end points as the metrics for the distance traveled.

In order to calculate the distance between the two points using the latitude and the longitude values, I used the extension cube and earthdistance.

CREATE EXTENSION cube;         
CREATE EXTENSION earthdistance;
-- Finding the average distance between start and end points for each member type.

SELECT 
    AVG((point(start_lng, start_lat) <@> point(end_lng, end_lat))) FILTER(WHERE member_casual = 'casual') AS casual_start_end_distance_in_mi,
    AVG((point(start_lng, start_lat) <@> point(end_lng, end_lat))) FILTER(WHERE member_casual = 'member') AS annual_start_end_distance_in_mi
FROM cyclistic.trip_data_v01;
1 records
casual_start_end_distance_in_mi annual_start_end_distance_in_mi
1.411926 1.297408

By answering Question 2, I learned that there wasn’t much difference in the average distances between the start and end points of the ride of the casual members and the annual members. On average, the casual members traveled just 8 percent further than the annual members.

Question 3: On which day of the week does each member type prefer to ride the bikes?

In order to figure out on which day the members preferred to ride the bikes, I needed to assign the appropriate day to every observation in the dataset. So I created a new column named “day.” Then, I queried the new table and converted the output values to percentages for fair comparison between the two member types.

-- Creating a temp table with 'day' columnn.

CREATE TEMP TABLE trip_data_v02 AS -- Creating a temp table with the 'day' columnn.
SELECT *, to_char(started_at, 'Day') AS day
FROM trip_data_v01;
-- Finding each member type's day preference for using the bike.

SELECT  member_casual AS member_type, day, COUNT(day)
FROM cyclistic.trip_data_v02
GROUP BY day, member_casual
ORDER BY member_casual, day;
14 records
member_type day count
casual Friday 363100
casual Monday 303997
casual Saturday 535565
casual Sunday 467081
casual Thursday 325171
casual Tuesday 277765
casual Wednesday 285483
member Friday 469253
member Monday 469039
member Saturday 444150
member Sunday 398919
member Thursday 525932
member Tuesday 518239
member Wednesday 516545

By answering Question 3, I learned that the casual members used the bikes more on the weekends, whereas the annual members used the bikes more during the weekdays, but pretty steadily throughout the whole week.

Question 4: How does the usage of bike change throughout the day for each member type?

For this question, I used hours as the metrics and defined the hours by counting every ride 30 minutes before and 29 minutes after the actual hours. For example, “12AM” was defined by counting every ride between 11:30 PM and 12:29 AM. Then, again, the output values were converted to percentages for fair comparison.

-- Finding the usage of bike per time of the day for each member type.

WITH temp_table AS(
    SELECT member_casual, CAST(started_at AS time)
    FROM cyclistic.trip_data_v02
)
SELECT member_casual AS member_type,
COUNT(started_at) FILTER (WHERE started_at >= ('23:30:00'::time) OR started_at <('00:30:00'::time)) AS "12AM",
COUNT(started_at) FILTER (WHERE started_at >= ('00:30:00'::time) AND started_at <('01:30:00'::time)) AS "1AM",
COUNT(started_at) FILTER (WHERE started_at >= ('01:30:00'::time) AND started_at <('02:30:00'::time)) AS "2AM",
COUNT(started_at) FILTER (WHERE started_at >= ('02:30:00'::time) AND started_at <('03:30:00'::time)) AS "3AM",
COUNT(started_at) FILTER (WHERE started_at >= ('03:30:00'::time) AND started_at <('04:30:00'::time)) AS "4AM",
COUNT(started_at) FILTER (WHERE started_at >= ('04:30:00'::time) AND started_at <('05:30:00'::time)) AS "5AM",
COUNT(started_at) FILTER (WHERE started_at >= ('05:30:00'::time) AND started_at <('06:30:00'::time)) AS "6AM",
COUNT(started_at) FILTER (WHERE started_at >= ('06:30:00'::time) AND started_at <('07:30:00'::time)) AS "7AM",
COUNT(started_at) FILTER (WHERE started_at >= ('07:30:00'::time) AND started_at <('08:30:00'::time)) AS "8AM",
COUNT(started_at) FILTER (WHERE started_at >= ('08:30:00'::time) AND started_at <('09:30:00'::time)) AS "9AM",
COUNT(started_at) FILTER (WHERE started_at >= ('09:30:00'::time) AND started_at <('10:30:00'::time)) AS "10AM",
COUNT(started_at) FILTER (WHERE started_at >= ('10:30:00'::time) AND started_at <('11:30:00'::time)) AS "11AM",
COUNT(started_at) FILTER (WHERE started_at >= ('11:30:00'::time) AND started_at <('12:30:00'::time)) AS "12PM",
COUNT(started_at) FILTER (WHERE started_at >= ('12:30:00'::time) AND started_at <('13:30:00'::time)) AS "1PM",
COUNT(started_at) FILTER (WHERE started_at >= ('13:30:00'::time) AND started_at <('14:30:00'::time)) AS "2PM",
COUNT(started_at) FILTER (WHERE started_at >= ('14:30:00'::time) AND started_at <('15:30:00'::time)) AS "3PM",
COUNT(started_at) FILTER (WHERE started_at >= ('15:30:00'::time) AND started_at <('16:30:00'::time)) AS "4PM",
COUNT(started_at) FILTER (WHERE started_at >= ('16:30:00'::time) AND started_at <('17:30:00'::time)) AS "5PM",
COUNT(started_at) FILTER (WHERE started_at >= ('17:30:00'::time) AND started_at <('18:30:00'::time)) AS "6PM",
COUNT(started_at) FILTER (WHERE started_at >= ('18:30:00'::time) AND started_at <('19:30:00'::time)) AS "7PM",
COUNT(started_at) FILTER (WHERE started_at >= ('19:30:00'::time) AND started_at <('20:30:00'::time)) AS "8PM",
COUNT(started_at) FILTER (WHERE started_at >= ('20:30:00'::time) AND started_at <('21:30:00'::time)) AS "9PM",
COUNT(started_at) FILTER (WHERE started_at >= ('21:30:00'::time) AND started_at <('22:30:00'::time)) AS "10PM",
COUNT(started_at) FILTER (WHERE started_at >= ('22:30:00'::time) AND started_at <('23:30:00'::time)) AS "11PM"
FROM temp_table
GROUP BY member_casual;
2 records
member_type 12AM 1AM 2AM 3AM 4AM 5AM 6AM 7AM 8AM 9AM 10AM 11AM 12PM 1PM 2PM 3PM 4PM 5PM 6PM 7PM 8PM 9PM 10PM 11PM
casual 61936 42461 31424 17758 11233 10005 18892 39396 64379 73464 91575 121560 153593 165915 174302 184622 200081 228076 231799 194049 142092 110909 101729 86912
member 45898 27641 18455 9705 7533 15502 58520 129186 199778 175106 135886 149397 186537 190571 186074 197941 247056 331833 321801 247004 170438 123980 96233 70002

By answering Question 4, I learned that, for the casual members, there was a steady rise in the use of the bikes from 5AM to 6PM and a steady fall after that. On the other hand, for the annual members, there were two noticeable peaks at 8AM and 5PM.

Question 5: What is each member type’s seasonal preference for riding the bikes?

The four seasons, spring, summer, fall and winter, were the metrics that I used to answer this question. The four seasons were defined as such: Spring, March to May, Summer, June to August, Fall, September to November, and Winter, December to February.

--Finding each member type's seasonal preference for riding the bikes.

WITH temp_table AS(
    SELECT member_casual, CAST(started_at AS date)
    FROM cyclistic.trip_data_v02
)
SELECT member_casual AS member_type,
       COUNT(started_at) FILTER (WHERE started_at >= ('2022-03-01'::date) AND started_at <= ('2022-05-31'::date)) AS "spring",
       COUNT(started_at) FILTER (WHERE started_at >= ('2022-06-01'::date) OR started_at <= ('2021-08-31'::date)) AS "summer",
       COUNT(started_at) FILTER (WHERE started_at >= ('2021-09-01'::date) AND started_at <= ('2021-11-30'::date)) AS "fall",
       COUNT(started_at) FILTER (WHERE started_at >= ('2021-12-01'::date) AND started_at <= ('2022-02-28'::date)) AS "winter"
FROM temp_table
GROUP BY member_casual;
2 records
member_type spring summer fall winter
casual 496711 1223754 728023 109674
member 793435 1172158 1019239 357245

By answering Question 5, I learned that the order of the seasonal popularity for riding the bikes were the same for the two member types. It was summer, fall, spring and winter in the order of most popular to least popular. However, I noticed that the annual members had more evenly distributed seasonal preferences for riding the bikes than the casual members

Question 6: What is each member type’s bike type preference?

There were 3 bike types for the members to choose from: classic bike, electric bike and docked bike.

-- Finding bike type preferences for each member type.

SELECT member_casual AS member_type, rideable_type, COUNT(rideable_type)
FROM cyclistic.trip_data_v02
GROUP BY rideable_type, member_casual;
5 records
member_type rideable_type count
casual classic_bike 1218261
member classic_bike 1971040
casual docked_bike 253366
casual electric_bike 1086535
member electric_bike 1371037

By answering Question 6, I learned that the annual members do not ride docked bikes. I also learned that both member types preferred classic bikes to electric bikes. However, electric bikes were still a popular option.

Combining the Insights

Strategy Recommendations

Based on the data analysis, I came up with strategies for converting the casual members into annual members:

  1. Concentrate the advertisements along the coastline because that is where the casual members gather.
  2. Offer promotions or coupons that exempt the casual members from having to pay the initial fee, along with an advertisement that informs that the annual members do not have to pay the initial fee. This is to allow the casual members to experience, feel and see the benefits of being an annual member.
  3. Advertise during the rush hours in traffic heavy areas. Emphasize how the bikes are not only good for leisure, but also for conveniently and efficiently transporting from one place to another, and also emphasize how the annual members have to pay once to get unlimited access to using the bikes.